Cardiovascular Disease is a general name for numerous types of diseases including heart attack, stroke, and heart failure. In 2019 an estimated 17.9 million people died from Cardiovascular Diseases and 85% were due to heart attack and stroke.
As living standards are improving over the past few decades, having a healthy lifestyle and avoiding serious illnesses have become people’s priorities and Cardiovascular diseases are the leading cause of death globally our group is interested in finding out some facts about Cardiovascular Disease such as what exactly Cardiovascular Diseases are. What are some factors that are correlate to Cardiovascular Diseases, among those factors what have the strongest correlation to Cardiovascular Diseases? And would a healthy lifestyle help to prevent people from having cardiovascular disease? Therefore, our group choose this dataset about Cardiovascular Diseases
The Cardiovascular Diseases dataset is extracted from Kaggle. It contains information from 70,000 records of patients that provided their age, height, weight, physical condition, and habits. Studying such data can help the public and health organizations identify potential factors that could lead to a higher chance of getting Cardiovascular Diseases, and develop appropriate strategies to reduce the number of patients. The dataset is cardio_train.csv. Before doing any analysis on this dataset, we did some research about what every single variable in the dataset represents, why they are included in this dataset, and what variables we will use in our model. With all those information government agents and the public will know how to adjust their habits or lifestyle to prevent getting Cardiovascular Diseases. Therefore, it is crucial to figure out current problems, and then publish the information.
CVD <- cardio_vascular_disease
CVD$gender <- factor(cardio_vascular_disease$gender)
CVD$smoke <- factor(cardio_vascular_disease$smoke)
CVD$alco <- factor(cardio_vascular_disease$alco)
CVD$active <- factor(cardio_vascular_disease$active)
CVD$cardio <- factor(cardio_vascular_disease$cardio)
CVD$cholesterol <- factor(cardio_vascular_disease$cholesterol)
CVD$gluc <- factor(cardio_vascular_disease$gluc)
CVD$weight <- as.integer(cardio_vascular_disease$weight)
CVD$age <- (cardio_vascular_disease$age/365)
levels(CVD$cardio) <- c("No", "Yes")
levels(CVD$gender) <- c("Female", "Male")
levels(CVD$cholesterol) <- c("Normal", "Above Normal", "Well Above Normal")
levels(CVD$gluc) <- c("Normal", "Above Normal", "Well Above Normal")
levels(CVD$smoke) <- c("No", "Yes")
levels(CVD$alco) <- c("No", "Yes")
levels(CVD$active) <- c("No", "Yes")
CVD <- na.omit(CVD)
str(CVD)
## 'data.frame': 70000 obs. of 13 variables:
## $ id : int 0 1 2 3 4 8 9 12 13 14 ...
## $ age : num 50.4 55.4 51.7 48.3 47.9 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 1 1 2 1 1 ...
## $ height : int 168 156 165 169 156 151 157 178 158 164 ...
## $ weight : int 62 85 64 82 56 67 93 95 71 68 ...
## $ ap_hi : int 110 140 130 150 100 120 130 130 110 110 ...
## $ ap_lo : int 80 90 70 100 60 80 80 90 70 60 ...
## $ cholesterol: Factor w/ 3 levels "Normal","Above Normal",..: 1 3 3 1 1 2 3 3 1 1 ...
## $ gluc : Factor w/ 3 levels "Normal","Above Normal",..: 1 1 1 1 1 2 1 3 1 1 ...
## $ smoke : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ alco : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ active : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 1 2 2 2 1 ...
## $ cardio : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 1 ...
The variables are the following:
1.id: Respondents’id | id | int
2.Age: Respondents’ age | age | int (years)
3.Height: Respondents’ height | height | int (cm)
4.Weight: Respondents’ weight | weight | num (kg)
5.Gender: Respondents’gender | gender | int
(1:Feale|2:Male)
6.Systolic blood pressure: Respondents’ systolic blood
pressure | ap_hi | int
7.Diastolic blood pressure: Respondents’ diastolic blood
pressure | ap_lo | int
8.Cholesterol: Respondents’ cholesterol level |
cholesterol | int (1: normal, 2: above normal, 3: well above normal)
9.Glucose: Respondents’ glucose level | gluc | int (1:
normal, 2: above normal, 3: well above normal)
10.Smoking: Whether respondent smokes or not | smoke |
int (0:Do not smoke, 1:Smoke)
11.Alcohol intake: Whether respondent drink or not |
alco | int (0:Do not drink, 1: Drink)
12.Physical activity: Respondents’ do not physical
activity regularly | active | int (0:Workout regularly, 1: No workout
regularly)
13.Presence or absence of cardiovascular disease:
Respondent present cardiovascular disease | cardio | int (0:No CVD
currently present, 1:CVD currently present)
For our data analysis, we converted gender,
cholesterol, glucose, smoke,
alco, active and cardio into
factor variables. Then we converted weight into integer and
converted age’s unit into year. Furthermore, we removed all
the NA in the data set.
We choose several factors from the data set and check if there are some factors that have a significant difference between CVD presented and CVD not presented.All the tables below will use percentage.
We start with smoke history and whether CVD are currently presented or not.
library(dplyr)
smoke_con<-aggregate(CVD$cardio, by=list(CVD$smoke,CVD$cardio), FUN=length)
smoke_con<-rename(smoke_con,Smoke = Group.1,CVD_presence = Group.2,CVD = x)
smoke_con$Smoke[smoke_con$Smoke=="0"] <- 'Not smoke'
smoke_con$Smoke[smoke_con$Smoke=="1"] <- 'Smoke'
smoke_con$CVD_presence[smoke_con$CVD_presence=="0"] <- 'Not Present'
smoke_con$CVD_presence[smoke_con$CVD_presence=="1"] <- 'Present'
xkabledply(smoke_con)
| Smoke | CVD_presence | CVD |
|---|---|---|
| No | No | 31781 |
| Yes | No | 3240 |
| No | Yes | 32050 |
| Yes | Yes | 2929 |
(alcohol intake).
alco_con<-aggregate(CVD$cardio,by=list(CVD$alco,CVD$cardio), FUN=length)
alco_con<-rename(alco_con,alco = Group.1,CVD_presence = Group.2,CVD = x)
alco_con$alco[alco_con$alco=="0"] <- 'Not drink'
alco_con$alco[alco_con$alco=="1"] <- 'Drink'
alco_con$CVD_presence[alco_con$CVD_presence=="0"] <- 'Not Present'
alco_con$CVD_presence[alco_con$CVD_presence=="1"] <- 'Present'
xkabledply(alco_con)
| alco | CVD_presence | CVD |
|---|---|---|
| No | No | 33080 |
| Yes | No | 1941 |
| No | Yes | 33156 |
| Yes | Yes | 1823 |
(Physical Activity).
active_con<-aggregate(CVD$cardio, by=list(CVD$active,CVD$cardio), FUN=length)
active_con<-rename(active_con,active = Group.1,CVD_presence = Group.2,CVD = x)
active_con$active[active_con$active=="0"] <- 'Physical Activity'
active_con$active[active_con$active=="1"] <- 'Not Physical Activity'
active_con$CVD_presence[active_con$CVD_presence=="0"] <- 'Not Present'
active_con$CVD_presence[active_con$CVD_presence=="1"] <- 'Present'
xkabledply(active_con)
| active | CVD_presence | CVD |
|---|---|---|
| No | No | 6378 |
| Yes | No | 28643 |
| No | Yes | 7361 |
| Yes | Yes | 27618 |
CVD-present vs CVD-not present (smoke)
smoke_pic <- data.frame(table(CVD$smoke,CVD$cardio))
names(smoke_pic) <- c("Smoke","Cardio","Count")
ggplot(data=smoke_pic, aes(x=Smoke, y=Count/sum(Count)*100, fill=Cardio)) + geom_bar(stat="identity")+scale_x_discrete(labels=c("0" = "Not smoke", "1" = "Smoke"))+scale_fill_discrete(labels=c("0" = "No cvd presented", "1" = "CVD presented"))+ggtitle("Percentage of Smoke history and present of CVDs") + ylab("Percentage")
CVD-present vs CVD-not present (alcohol)
alco_pic <- data.frame(table(CVD$alco,CVD$cardio))
names(alco_pic) <- c("alco","Cardio","Count")
ggplot(data=alco_pic, aes(x=alco, y=Count/sum(Count)*100, fill=Cardio)) + geom_bar(stat="identity")+scale_x_discrete(labels=c("0" = "Not Drink", "1" = "Drink"))+scale_fill_discrete(labels=c("0" = "No cvd presented", "1" = "CVD presented"))+ggtitle("Percentage of alcohol intake and present of CVDs") + ylab("Percentage")
CVD-present vs CVD-not present (physical activity)
active_pic <- data.frame(table(CVD$active,CVD$cardio))
names(active_pic) <- c("active","Cardio","Count")
ggplot(data=active_pic, aes(x=active, y=Count/sum(Count)*100, fill=Cardio)) + geom_bar(stat="identity")+scale_x_discrete(labels=c("0" = "Workout regularly", "1" = "Not workout regularly"))+scale_fill_discrete(labels=c("0" = "No cvd presented", "1" = "CVD presented"))+ggtitle("Percentage of workout and present of CVDs") + ylab("Percentage")
We used the Chi- squared test to testify if categorical variables are related to cardio vascular disease. We set alpha as 0.1.
Are they independent ?
- \(H_0\): CVD and smoking are
independent.
- \(H_1\): They are not
independent.
sctest<- chisq.test(smoke_cvd)
sctest
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: smoke_cvd
## X-squared = 17, df = 1, p-value = 4e-05
Since p-value = 4e-05 < 0.1, we reject the null hypothesis, so
smoke has impact on cardio.
Are they independent ?
- \(H_0\): CVD and drinking alcohol are
independent.
- \(H_1\): They are not
independent.
alctest<- chisq.test(al_cvd)
alctest
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: al_cvd
## X-squared = 4, df = 1, p-value = 0.05
Since p-value = 0.05 < 0.1, we reject the null hypothesis, so
alco has impact on cardio. However, if we set
the alpha = 0.05, we would failed to reject the null hypothesis.
Are they independent ?
- \(H_0\): CVD and physical activity
are independent.
- \(H_1\): They are not
independent.
act_test<- chisq.test(act_cvd)
act_test
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: act_cvd
## X-squared = 89, df = 1, p-value <2e-16
Since p-value = 2e-16 < 0.1, we reject the null hypothesis, so
active has an impact on cardio.
After a brief overview of the data set, it tells that there are 3 continuous variables, age, height and weight.
KDE plot represents the data using a continuous probability density curve in one or more dimensions. We can easily observe the distribution of samples with KDE plot.
CVD_new<- CVD
levels(CVD_new$cardio) <- c("No CVD Present", "CVD Present")
CVD_new$cardio[CVD_new$cardio=="0"] <- 'No CVD Present'
CVD_new$cardio[CVD_new$cardio=="1"] <- 'CVD Present'
cols<-c("No CVD Present"="green","CVD Present"="blue")
str(CVD_new)
age_kdeplot <- ggplot(data = CVD_new, aes(x = age, color =cardio)) +
geom_density(aes(fill = cardio), alpha = 0.8) +
scale_fill_manual(values =cols) +
labs(title="KDEplot for age") +
labs(x="age", y="density") +
theme(legend.position="top")
age_kdeplot
height_kdeplot <- ggplot(data = CVD_new, aes(x = height, color = cardio)) +
geom_density(aes(fill = cardio), alpha = 0.8) +
scale_fill_manual(values =cols) +
labs(title="KDEplot for height") +
labs(x="height", y="density") +
theme(legend.position="top")
height_kdeplot
weight_kdeplot <- ggplot(data = CVD_new, aes(x = weight, color = cardio)) +
geom_density(aes(fill = cardio), alpha = 0.8) +
scale_fill_manual(values= cols) +
labs(title="KDEplot for weight") +
labs(x="weight", y="density") +
theme(legend.position="top")
weight_kdeplot
As we can see from KDE plot for age, people with higher age are more
likely to have CVD. And from KDE plot for height and weight , we can
find that people with CVD and without CVD have very similar
distributions. From these 3 KDE plots, which means,CVD may be positive
correlated with age. Finally, weight may only make a little attribution
to people presenting CVD.
The logit model is often used for classification and predictive
analytics. Logistic regression estimates the probability of an event
occurring, such as voted or didn’t vote, based on a given data set of
independent variables. Since the outcome is a probability, the dependent
variable is bounded between 0 and 1(binary outcome). In this case,
cardio is the binary outcome; as s result, we choose
logistic regression to predict.
lm1 <- glm(cardio~age, family = binomial(link = "logit"), data = CVD)
lm2 <- glm(cardio~age + weight, family = binomial(link = "logit"), data = CVD)
lm3 <- glm(cardio~age+ height + weight, family = binomial(link = "logit"), data = CVD)
This is the summary of model 1.
xkabledply(lm1, title="summary of model 1")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -3.9280 | 0.0642 | -61.1 | 0 |
| age | 0.0736 | 0.0012 | 61.7 | 0 |
This is the summary of model 2.
xkabledply(lm2, title="summary of model 2")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -5.8346 | 0.0791 | -73.8 | 0 |
| age | 0.0730 | 0.0012 | 60.2 | 0 |
| weight | 0.0262 | 0.0006 | 45.1 | 0 |
This is the summary of model 3.
xkabledply(lm3, title="summary of model 3")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -3.8421 | 0.1788 | -21.5 | 0 |
| age | 0.0716 | 0.0012 | 58.8 | 0 |
| height | -0.0127 | 0.0010 | -12.3 | 0 |
| weight | 0.0285 | 0.0006 | 46.4 | 0 |
anovat <- anova(lm1,lm2,lm3, test="LRT")
anovat
## Analysis of Deviance Table
##
## Model 1: cardio ~ age
## Model 2: cardio ~ age + weight
## Model 3: cardio ~ age + height + weight
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 69998 92985
## 2 69997 90810 1 2175 <2e-16 ***
## 3 69996 90656 1 154 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As it is shown above, with small p-values, model 3 is the best.
We can use AUC and ROC to measure model 2 and model 3. AUC (Area Under The Curve) - ROC (Receiver Operating Characteristics) curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. For summarize, the Higher the AUC, the better the model is.
For model 1:
prob <- predict(lm1,CVD, type = c("response"))
CVD$prob <- prob
library(pROC)
g<- roc(cardio~prob, data = CVD)
plot(g, main = "ROC curve of model 1")
auc(CVD$cardio, prob)
## Area under the curve: 0.635
The AUC value is 0.635 < 0.8. We think this is not a good fit.
For model 2:
prob <- predict(lm2,CVD, type = c("response"))
CVD$prob <- prob
library(pROC)
g<- roc(cardio~prob, data = CVD)
plot(g, main = "ROC curve of model 2")
auc(CVD$cardio, prob)
## Area under the curve: 0.668
The AUC value is 0.668. As 0.668 is less than 0.8, this model is not good enough.
For model 3:
prob1 <- predict(lm3,CVD, type = c("response"))
CVD$prob1 <- prob1
library(pROC)
g<- roc(cardio~prob1, data = CVD)
plot(g, main = "ROC curve of model 3")
auc(CVD$cardio, prob1)
## Area under the curve: 0.671
The AUC value is 0.671. As 0.671 is less than 0.8 and 0.671 is more than 0.668, this model is not good enough but is better than model 2.
In the 3 continuous variables, cardio has positive
correlation with age and weight ; it also has
negative correlation withheight . It means when a person
with higher weight and shorter height, he has
more probabilities to have Cardio Vascular Disease. And age
is not significant influence factor to people cardio rates
in 3 continuous variables.
From the results of above variables, we could find that some of the above variables have a significant effect on CVD. Next, we choose to graph the groups of variables with relatively large correlation coefficients one by one to explore how they affect the Cardio Vascular Disease.
As most of variables are factors, we check their Spearman correlations.
First, we convert all the categorical variables to numeric.
CVDNum = CVD
for(i in 1:13){
# age, height, weight
if (!(i %in% c(2, 4, 5))){
CVDNum[,i] = as.numeric(CVDNum[,i])
}
}
CVDNum$prob1 <- NULL
CVDNum$prob <- NULL
str(CVDNum)
## 'data.frame': 70000 obs. of 13 variables:
## $ id : num 0 1 2 3 4 8 9 12 13 14 ...
## $ age : num 50.4 55.4 51.7 48.3 47.9 ...
## $ gender : num 2 1 1 2 1 1 1 2 1 1 ...
## $ height : int 168 156 165 169 156 151 157 178 158 164 ...
## $ weight : int 62 85 64 82 56 67 93 95 71 68 ...
## $ ap_hi : num 110 140 130 150 100 120 130 130 110 110 ...
## $ ap_lo : num 80 90 70 100 60 80 80 90 70 60 ...
## $ cholesterol: num 1 3 3 1 1 2 3 3 1 1 ...
## $ gluc : num 1 1 1 1 1 2 1 3 1 1 ...
## $ smoke : num 1 1 1 1 1 1 1 1 1 1 ...
## $ alco : num 1 1 1 1 1 1 1 1 1 1 ...
## $ active : num 2 2 1 2 1 1 2 2 2 1 ...
## $ cardio : num 1 2 2 2 1 1 1 2 1 1 ...
CVDcor <- cor(subset(CVDNum, select=-id), method="spearman")
loadPkg("corrplot")
corrplot(CVDcor, type="lower", addCoef.col="black", number.cex=0.5, tl.cex=0.7,title="Spearman Correlation for CVD", mar=c(0,0,1,0))
Larger circle means higher correlation. We can see that CVD has negative correlation with cholesterol and height, which means that person who has shorter height or has high cholesterol is less likely to have Cardio Vascular Disease. Person who smoke and has high level cholesterol is also more likely to have CVD. So it makes sense that cholesterol and smoke have positive correlation, which means most person who smoke also have high level cholesterol .
a1 <- ggplot(CVD_factor,aes(x = smoke , y = age , fill = cardio)) +
geom_violin(alpha = 0.5, aes(linetype=NA)) +
xlab("Smoke") + ylab("Age")
a2 <- ggplot(CVD_factor,aes(x = alco , y = age , fill = cardio)) +
geom_violin(alpha = 0.5, aes(linetype=NA)) +
xlab("Alco") + ylab("Age")
a3 <- ggplot(CVD_factor, aes(x=height, y=age, color= cardio)) +
geom_point(size=.5,alpha=0.4)
a3/ (a1+a2) + plot_annotation(title = 'Age Plot')
First, Age and height show no correlation.
Second, for Smoke, the sample groups of smoking and older age have higher CVD rates.
Finally, for cholesterol, people are also prone to CVD when drinking and older age.
h1 <- ggplot(CVD_factor,aes(x = active , y = height , fill = cardio)) +
geom_violin(alpha = 0.5, aes(linetype=NA)) +
xlab("Active") + ylab("Height")
h2 <- ggplot(CVD_factor, aes(x=weight, y=height, color= cardio)) +
geom_point(size=0.5,alpha=0.4)
h1/ h2 + plot_annotation(title = 'Height Plot')
First, for Active, the sample groups of having physical activity have little higher CVD rates.
Second, weight and height show a little positive correlation.
w1 <- ggplot(CVD_factor,aes(x = active , y = weight , fill = cardio)) +
geom_violin(alpha = 0.5, aes(linetype=NA)) +
xlab("Active") + ylab("Weight")
w2 <- ggplot(CVD_factor,aes(x = alco , y = weight , fill = cardio)) +
geom_violin(alpha = 0.5, aes(linetype=NA)) +
xlab("Alco") + ylab("Weight")
w1/ w2 + plot_annotation(title = 'Weight Plot')
First, for Active, the sample groups of having activity have little higher CVD rates.
Second, for Alcohol, the sample groups of no alcohol (“0”) have little higher CVD rates.
The first model is logistic regression. We take cardio
as dependent variables and we set alpha equal 0.05. So, rest of the 11
variables are going to be the independent variables and we are going to
delete the variables that have p-value > 0.05 in the model
summary.
df <- CVD
df$id <- NULL
df$prob1 <- NULL
df$prob <- NULL
levels(df$cardio) <- c("No", "Yes")
levels(df$gender) <- c("Female", "Male")
levels(df$cholesterol) <- c("Normal", "Above Normal", "Well Above Normal")
levels(df$gluc) <- c("Normal", "Above Normal", "Well Above Normal")
levels(df$smoke) <- c("No", "Yes")
levels(df$alco) <- c("No", "Yes")
levels(df$active) <- c("No", "Yes")
df <- na.omit(df)
str(df)
First model covering all variables:
logistic_all <- glm(cardio ~ age + gender + height + weight + ap_hi + ap_lo + cholesterol + gluc + smoke + alco + active, data = df, family = "binomial")
summary(logistic_all)
xkabledply(logistic_all, title = "Summary of Logistic Model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -8.0841 | 0.2213 | -36.534 | 0.0000 |
| age | 0.0542 | 0.0013 | 41.735 | 0.0000 |
| genderMale | 0.0143 | 0.0211 | 0.678 | 0.4974 |
| height | -0.0056 | 0.0012 | -4.568 | 0.0000 |
| weight | 0.0152 | 0.0007 | 23.024 | 0.0000 |
| ap_hi | 0.0395 | 0.0006 | 65.235 | 0.0000 |
| ap_lo | 0.0003 | 0.0001 | 4.460 | 0.0000 |
| cholesterolAbove Normal | 0.4222 | 0.0259 | 16.286 | 0.0000 |
| cholesterolWell Above Normal | 1.1341 | 0.0344 | 32.928 | 0.0000 |
| glucAbove Normal | 0.0301 | 0.0344 | 0.876 | 0.3811 |
| glucWell Above Normal | -0.3388 | 0.0381 | -8.894 | 0.0000 |
| smokeYes | -0.1314 | 0.0332 | -3.957 | 0.0001 |
| alcoYes | -0.1695 | 0.0403 | -4.210 | 0.0000 |
| activeYes | -0.2101 | 0.0210 | -9.981 | 0.0000 |
With p-value = 0.4974, we delete gender variable.
Second Model:
logistic_1 <- glm(cardio ~ age + height + weight + ap_hi + ap_lo + cholesterol + gluc + smoke + alco + active, data = df, family = "binomial")
summary(logistic_1)
xkabledply(logistic_1, title = "Summary of Logistic Model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -8.1439 | 0.2030 | -40.121 | 0.0000 |
| age | 0.0542 | 0.0013 | 41.760 | 0.0000 |
| height | -0.0053 | 0.0011 | -4.761 | 0.0000 |
| weight | 0.0152 | 0.0007 | 23.024 | 0.0000 |
| ap_hi | 0.0395 | 0.0006 | 65.330 | 0.0000 |
| ap_lo | 0.0003 | 0.0001 | 4.464 | 0.0000 |
| cholesterolAbove Normal | 0.4218 | 0.0259 | 16.274 | 0.0000 |
| cholesterolWell Above Normal | 1.1337 | 0.0344 | 32.922 | 0.0000 |
| glucAbove Normal | 0.0301 | 0.0344 | 0.874 | 0.3821 |
| glucWell Above Normal | -0.3389 | 0.0381 | -8.899 | 0.0000 |
| smokeYes | -0.1256 | 0.0321 | -3.914 | 0.0001 |
| alcoYes | -0.1681 | 0.0402 | -4.180 | 0.0000 |
| activeYes | -0.2101 | 0.0210 | -9.980 | 0.0000 |
With p-value = 0.3821, we delete glucose variable.
Third Model:
logistic_2 <- glm(cardio ~ age + height + weight + ap_hi + ap_lo + cholesterol + smoke + alco + active, data = df, family = "binomial")
summary(logistic_2)
xkabledply(logistic_2, title = "Summary of Logistic Model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -8.1242 | 0.2028 | -40.06 | 0e+00 |
| age | 0.0539 | 0.0013 | 41.55 | 0e+00 |
| height | -0.0054 | 0.0011 | -4.86 | 0e+00 |
| weight | 0.0152 | 0.0007 | 23.02 | 0e+00 |
| ap_hi | 0.0396 | 0.0006 | 65.49 | 0e+00 |
| ap_lo | 0.0003 | 0.0001 | 4.48 | 0e+00 |
| cholesterolAbove Normal | 0.4234 | 0.0250 | 16.96 | 0e+00 |
| cholesterolWell Above Normal | 0.9855 | 0.0296 | 33.27 | 0e+00 |
| smokeYes | -0.1222 | 0.0320 | -3.81 | 1e-04 |
| alcoYes | -0.1641 | 0.0401 | -4.09 | 0e+00 |
| activeYes | -0.2085 | 0.0210 | -9.91 | 0e+00 |
With all p-value being small, we accept this model as our final
model.
Based on the result, we can say:
age, weight,
systolic blood pressure,
diastolic blood pressure and cholesterol have
positive impact on the log (odds-ratio) of having cardio vascular
disease.
height, smoking, alcohol and
physical activity have negative impact on the log
(odds-ratio) of having cardio vascular disease.
expcoef = exp(coef(logistic_2))
summary(expcoef)
xkabledply( as.table(expcoef), title = "Exponential of coefficients in Cardio")
| x | |
|---|---|
| (Intercept) | 0.0003 |
| age | 1.0553 |
| height | 0.9947 |
| weight | 1.0153 |
| ap_hi | 1.0404 |
| ap_lo | 1.0003 |
| cholesterolAbove Normal | 1.5272 |
| cholesterolWell Above Normal | 2.6792 |
| smokeYes | 0.8850 |
| alcoYes | 0.8487 |
| activeYes | 0.8118 |
For exponential of coefficients:
Every year increase in age will increase the odds-ratio
having CVD present by a factor of 1.0553.
Every centimeter increase in hight will decrease the
odds-ratio having CVD present by a factor of 0.9947.
Every kilogram increase in weight will increase the
odds-ratio having CVD present by a factor of 1.0153.
Every unit increase in systolic blood pressure will
increase the odds-ratio having CVD present by a factor of 1.0404.
Every unit increase in diastolic blood pressure will
increase the odds-ratio having CVD present by a factor of 1.0003.
Having above normal cholesterol, compared to normal
cholesterol, is increasing the odds-ratio having CVD
present by a factor of 1.5272.
Having well above normal cholesterol, compared to normal
cholesterol, is increasing the odds-ratio having CVD
present by a factor of 2.6792.
Smoking, compared to not smoking, will decrease the odds-ratio having
CVD present by a factor of 0.8850. Drinking, compared to not drinking
alcohol, will decrease the odds-ratio having CVD present by a factor of
0.8487.
Having activity, compared to not having activity, will decrease the
odds-ratio having CVD present by a factor of 0.8118.
loadPkg("regclass")
confusion_matrix(logistic_2)
cfmatrix1 = confusion_matrix(logistic_2)
accuracy1 <- (cfmatrix1[1,1]+cfmatrix1[2,2])/cfmatrix1[3,3]
precision1 <- cfmatrix1[2,2]/(cfmatrix1[2,2]+cfmatrix1[1,2])
recall1 <- cfmatrix1[2,2]/(cfmatrix1[2,2]+cfmatrix1[2,1])
specificity1 <- cfmatrix1[1,1]/(cfmatrix1[1,1]+cfmatrix1[1,2])
F1_score1 <- 2*(precision1)*(recall1)/(precision1 + recall1)
accuracy1
precision1
recall1
specificity1
F1_score1
xkabledply( confusion_matrix(logistic_2), title = "Confusion matrix from Logit Model" )
From confusion matrix, we conclude that:
- Accuracy = 0.722
- Precision = 0.743
- Recall = 0.677
- Specificity = 0.767
- F1 score = 0.708
loadPkg("pROC")
prob <- predict(logistic_2, type = "response")
df$prob <- prob
h <- roc(cardio ~ prob, data = df)
auc(h)
## Area under the curve: 0.785
plot(h)
We have the area under the curve of 0.785, this is not a good fit.
Feature selection based on adjusted R^2.
loadPkg("leaps")
reg.leaps <- regsubsets(cardio ~ ., data = df, nbest = 1, method = "exhaustive")
plot(reg.leaps, scale = "adjr2", main = "Adjusted R^2")
Feature selection based on BIC.
plot(reg.leaps, scale = "bic", main = "BIC")
Feature selection based on Cp.
plot(reg.leaps, scale = "Cp", main = "Cp")
All three methods have similar results.
We could use K-nearest-neighbor algorithm to predict whether patient will get CVD, by looking at K neighbors who share similar attributes.
df$prob <- NULL
str(df)
df_num <- df
for(i in 1:12){
if (!(i %in% c(1,3:6))){
df_num[,i] = as.numeric(df_num[,i])
}
}
str(df_num)
scale1 <- subset(df_num, select = -cardio)
scale1$cardio <- df$cardio
str(scale1)
scale1[c(1,3,4,5,6)] <- as.data.frame(scale(scale1[c(1,3,4,5,6)], center = TRUE, scale = TRUE))
set.seed(1)
df_sample <- sample(2, nrow(scale1), replace = TRUE, prob = c(0.75,0.25))
df_train <- scale1[df_sample==1, 1:11]
df_test <- scale1[df_sample==2, 1:11]
df.trainLabels <- df[df_sample==1, 12]
df.testLabels <- df[df_sample==2, 12]
str(df_train)
## 'data.frame': 52430 obs. of 11 variables:
## $ age : num -0.436 0.308 -0.248 -0.809 1.263 ...
## $ gender : num 2 1 1 1 2 1 1 1 2 2 ...
## $ height : num 0.443 -1.018 0.078 -1.018 1.661 ...
## $ weight : num -0.848 0.75 -0.709 -1.265 1.445 ...
## $ ap_hi : num -0.12218 0.07261 0.00768 -0.18711 0.00768 ...
## $ ap_lo : num -0.0882 -0.0352 -0.1413 -0.1944 -0.0352 ...
## $ cholesterol: num 1 3 3 1 3 1 1 1 1 1 ...
## $ gluc : num 1 1 1 1 3 1 1 1 1 1 ...
## $ smoke : num 1 1 1 1 1 1 1 1 1 1 ...
## $ alco : num 1 1 1 1 1 1 1 1 1 1 ...
## $ active : num 2 2 1 1 2 2 1 2 2 1 ...
str(df_test)
## 'data.frame': 17570 obs. of 11 variables:
## $ age : num -0.748 0.991 1.072 -2.001 -1.103 ...
## $ gender : num 2 1 1 2 1 2 2 2 2 2 ...
## $ height : num 0.565 -1.627 -0.896 2.027 -0.775 ...
## $ weight : num 0.542 -0.5 1.306 1.445 -1.542 ...
## $ ap_hi : num 0.13754 -0.05725 0.00768 0.00768 -0.12218 ...
## $ ap_lo : num 0.0179 -0.0882 -0.0882 -0.0352 -0.1413 ...
## $ cholesterol: num 1 2 3 1 1 1 1 1 3 1 ...
## $ gluc : num 1 2 1 1 3 1 1 1 1 1 ...
## $ smoke : num 1 1 1 2 1 2 1 1 1 1 ...
## $ alco : num 1 1 1 2 1 1 1 1 1 1 ...
## $ active : num 2 1 2 2 2 2 2 2 1 2 ...
knn_different_k = sapply(seq(1, 21, by = 2),
function(x) chooseK(x,
train_set = df_train,
val_set = df_test,
train_class = df.trainLabels,
val_class = df.testLabels))
knn_different_k = data.frame(k = knn_different_k[1,],
accuracy = knn_different_k[2,])
library("ggplot2")
ggplot(knn_different_k,
aes(x = k, y = accuracy)) +
geom_line(color = "orange", size = 1.5) +
geom_point(size = 3) +
labs(title = "accuracy vs k")
xkabledply((knn_different_k))
| k | accuracy |
|---|---|
| 1 | 0.610 |
| 3 | 0.635 |
| 5 | 0.644 |
| 7 | 0.651 |
| 9 | 0.658 |
| 11 | 0.659 |
| 13 | 0.660 |
| 15 | 0.660 |
| 17 | 0.665 |
| 19 | 0.662 |
| 21 | 0.665 |
k=17 is a decent choice. After k=17, the
accuracy start to going up and down.
pred <- knn(train = df_train, test = df_test, cl=df.trainLabels, k=17)
pred
loadPkg("gmodels")
churnPredCross <- CrossTable(df.testLabels, pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 17570
##
##
## | pred
## df.testLabels | No | Yes | Row Total |
## --------------|-----------|-----------|-----------|
## No | 6040 | 2641 | 8681 |
## | 0.696 | 0.304 | 0.494 |
## | 0.651 | 0.319 | |
## | 0.344 | 0.150 | |
## --------------|-----------|-----------|-----------|
## Yes | 3245 | 5644 | 8889 |
## | 0.365 | 0.635 | 0.506 |
## | 0.349 | 0.681 | |
## | 0.185 | 0.321 | |
## --------------|-----------|-----------|-----------|
## Column Total | 9285 | 8285 | 17570 |
## | 0.528 | 0.472 | |
## --------------|-----------|-----------|-----------|
##
##
Over the 17570 observations, the model correctly predicted that 6040
patients do not have CVD, and correctly predicted that 5643 patients
have CVD.
Accuracy = 0.6648
Precision = 0.6809
Recall = 0.6348
Specificity = 0.6954
F1_score = 0.657
Based on previous EDA and regression model, we removed
gender and gluc.
df_num2 <- subset(df_num, select = -c(gender, gluc, prob))
str(df_num2)
scale2 <- df_num2
scale2[1:5] <- scale(df_num2[1:5], center = TRUE, scale = TRUE)
scale2$cardio <- df$cardio
str(scale2)
set.seed(1)
df_sample2 <- sample(2, nrow(scale2), replace = TRUE, prob = c(0.75,0.25))
df_train2 <- scale2[df_sample2==1, 1:9]
df_test2 <- scale2[df_sample2==2, 1:9]
df.trainLabel2 <- scale2[df_sample2==1, 10]
df.testLabel2 <- scale2[df_sample2==2, 10]
str(df_train2)
## 'data.frame': 52430 obs. of 9 variables:
## $ age : num -0.436 0.308 -0.248 -0.809 1.263 ...
## $ height : num 0.443 -1.018 0.078 -1.018 1.661 ...
## $ weight : num -0.848 0.75 -0.709 -1.265 1.445 ...
## $ ap_hi : num -0.12218 0.07261 0.00768 -0.18711 0.00768 ...
## $ ap_lo : num -0.0882 -0.0352 -0.1413 -0.1944 -0.0352 ...
## $ cholesterol: num 1 3 3 1 3 1 1 1 1 1 ...
## $ smoke : num 1 1 1 1 1 1 1 1 1 1 ...
## $ alco : num 1 1 1 1 1 1 1 1 1 1 ...
## $ active : num 2 2 1 1 2 2 1 2 2 1 ...
str(df_test2)
## 'data.frame': 17570 obs. of 9 variables:
## $ age : num -0.748 0.991 1.072 -2.001 -1.103 ...
## $ height : num 0.565 -1.627 -0.896 2.027 -0.775 ...
## $ weight : num 0.542 -0.5 1.306 1.445 -1.542 ...
## $ ap_hi : num 0.13754 -0.05725 0.00768 0.00768 -0.12218 ...
## $ ap_lo : num 0.0179 -0.0882 -0.0882 -0.0352 -0.1413 ...
## $ cholesterol: num 1 2 3 1 1 1 1 1 3 1 ...
## $ smoke : num 1 1 1 2 1 2 1 1 1 1 ...
## $ alco : num 1 1 1 2 1 1 1 1 1 1 ...
## $ active : num 2 1 2 2 2 2 2 2 1 2 ...
knn_different_k2 = sapply(seq(1, 21, by = 2),
function(x) chooseK(x,
train_set = df_train2,
val_set = df_test2,
train_class = df.trainLabel2,
val_class = df.testLabel2))
str(knn_different_k2)
## num [1:2, 1:11] 1 0.612 3 0.642 5 ...
knn_different_k2 = data.frame(k = knn_different_k2[1,],
accuracy = knn_different_k2[2,])
library("ggplot2")
ggplot(knn_different_k2,
aes(x = k, y = accuracy)) +
geom_line(color = "orange", size = 1.5) +
geom_point(size = 3) +
labs(title = "accuracy vs k")
xkabledply((knn_different_k2))
| k | accuracy |
|---|---|
| 1 | 0.612 |
| 3 | 0.642 |
| 5 | 0.654 |
| 7 | 0.660 |
| 9 | 0.663 |
| 11 | 0.668 |
| 13 | 0.672 |
| 15 | 0.670 |
| 17 | 0.670 |
| 19 | 0.671 |
| 21 | 0.670 |
We should select value k=13 here as it has the highest
accuracy.
pred2 <- knn(train = df_train2, test = df_test2, cl = df.trainLabel2, k = 13)
knn.roc.prob <- attr(knn(train = df_train2, test = df_test2, cl = df.trainLabel2, k = 13, prob = T), 'prob')
pred2
churnPredCross2 <- CrossTable(df.testLabel2, pred2, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 17570
##
##
## | pred2
## df.testLabel2 | No | Yes | Row Total |
## --------------|-----------|-----------|-----------|
## No | 6085 | 2596 | 8681 |
## | 0.701 | 0.299 | 0.494 |
## | 0.657 | 0.312 | |
## | 0.346 | 0.148 | |
## --------------|-----------|-----------|-----------|
## Yes | 3172 | 5717 | 8889 |
## | 0.357 | 0.643 | 0.506 |
## | 0.343 | 0.688 | |
## | 0.181 | 0.325 | |
## --------------|-----------|-----------|-----------|
## Column Total | 9257 | 8313 | 17570 |
## | 0.527 | 0.473 | |
## --------------|-----------|-----------|-----------|
##
##
Accuracy = 0.6717 Precision = 0.6877 Recall = 0.6432 Specificity =
0.701 F1_score = 0.6616
All the statistic have improved a little bit comparing to the previous
KNN model.
k13 <- knn(train = df_train2, test = df_test2, cl = df.trainLabel2, k = 13, prob = TRUE)
prob = attr(k13, "prob")
prob <- 2*ifelse(k13 == "-1", 1-prob, prob) - 1
loadPkg("pROC")
roc(df.testLabel2, prob)
plot(roc(df.testLabel2, prob), print.thres = T, print.auc = T, )
tab <- matrix(c(0.6583, 0.6733, 0.6308, 0.6866, 0.6514, 0.6717, 0.6877, 0.6432, 0.701, 0.6616, 0.722, 0.743, 0.677, 0.767, 0.708), ncol = 5, byrow = TRUE)
colnames(tab) <- c("Accuracy", "Precision", "Recall", "Specificity", "F1 Score")
rownames(tab) <- c("KNN with all variables", "KNN with selected variables", "Logistic with selected variables")
tab <- as.table(tab)
xkabledply(tab, "Models Comparison")
| Accuracy | Precision | Recall | Specificity | F1 Score | |
|---|---|---|---|---|---|
| KNN with all variables | 0.658 | 0.673 | 0.631 | 0.687 | 0.651 |
| KNN with selected variables | 0.672 | 0.688 | 0.643 | 0.701 | 0.662 |
| Logistic with selected variables | 0.722 | 0.743 | 0.677 | 0.767 | 0.708 |
Overall, logistic regressio has the best performance among three models.
First, we change all the variables to numeric variables except
cardio.
str(df)
df_num$cardio <- factor(CVD$cardio)
levels(df_num$cardio) <- c("No", "Yes")
str(df_num)
Then, we use randomForest and MeanDecreaseGini to create
a list of feature importance of all the variables.
library(randomForest)
fit_im = randomForest(df_num$cardio~., data=df_num)
importance(fit_im)
## MeanDecreaseGini
## age 5947
## gender 401
## height 3053
## weight 3477
## ap_hi 5822
## ap_lo 2847
## cholesterol 1273
## gluc 485
## smoke 247
## alco 214
## active 352
varImpPlot(fit_im)
Based on the MeanDecreaseGini, we select
age,
weight, diastolic blood pressure,
height and cholesterol as our variables for
classification tree model. We excluded
systolic blood pressure due to its high correlation with
diastolic blood pressure.
Models Summary with varying depths.
loadPkg("rpart")
loadPkg("caret")
# create an empty dataframe to store the results from confusion matrices
confusionMatrixResultDf = data.frame(Depth = numeric(0), Accuracy = numeric(0), Sensitivity = numeric(0), Specificity = numeric(0),
Pos.Pred.Value = numeric(0), Neg.Pred.Value = numeric(0), Precision = numeric(0), Recall = numeric(0),
F1 = numeric(0), Prevalence = numeric(0), Detection.Rate = numeric(0), Detection.Prevalence = numeric(0),
Balanced.Accurary = numeric(0), row.names = NULL)
for (deep in 2:15) {
kfit <- rpart(cardio ~ age + weight + ap_lo + height + cholesterol, data = df_num, method="class", control = list(maxdepth = deep) )
#
cm = confusionMatrix( predict(kfit, type = "class"), reference = df_num[, "cardio"] ) # from caret library
#
cmaccu = cm$overall['Accuracy']
# print( paste("Total Accuracy = ", cmaccu ) )
#
cmt = data.frame(Depth=deep, Accuracy = cmaccu, row.names = NULL ) # initialize a row of the metrics
cmt = cbind( cmt, data.frame( t(cm$byClass) ) ) # the dataframe of the transpose, with k valued added in front
confusionMatrixResultDf = rbind(confusionMatrixResultDf, cmt)
# print("Other metrics : ")
}
unloadPkg("caret")
xkabledply(confusionMatrixResultDf, title = "Cardio Classificantion Trees summary with varying MaxDepth")
| Depth | Accuracy | Sensitivity | Specificity | Pos.Pred.Value | Neg.Pred.Value | Precision | Recall | F1 | Prevalence | Detection.Rate | Detection.Prevalence | Balanced.Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 0.670 | 0.582 | 0.757 | 0.706 | 0.644 | 0.706 | 0.582 | 0.638 | 0.5 | 0.291 | 0.413 | 0.670 |
| 3 | 0.689 | 0.831 | 0.546 | 0.647 | 0.764 | 0.647 | 0.831 | 0.728 | 0.5 | 0.416 | 0.643 | 0.689 |
| 4 | 0.702 | 0.771 | 0.632 | 0.677 | 0.734 | 0.677 | 0.771 | 0.721 | 0.5 | 0.386 | 0.570 | 0.702 |
| 5 | 0.702 | 0.771 | 0.632 | 0.677 | 0.734 | 0.677 | 0.771 | 0.721 | 0.5 | 0.386 | 0.570 | 0.702 |
| 6 | 0.702 | 0.771 | 0.632 | 0.677 | 0.734 | 0.677 | 0.771 | 0.721 | 0.5 | 0.386 | 0.570 | 0.702 |
| 7 | 0.702 | 0.771 | 0.632 | 0.677 | 0.734 | 0.677 | 0.771 | 0.721 | 0.5 | 0.386 | 0.570 | 0.702 |
| 8 | 0.702 | 0.771 | 0.632 | 0.677 | 0.734 | 0.677 | 0.771 | 0.721 | 0.5 | 0.386 | 0.570 | 0.702 |
| 9 | 0.702 | 0.771 | 0.632 | 0.677 | 0.734 | 0.677 | 0.771 | 0.721 | 0.5 | 0.386 | 0.570 | 0.702 |
| 10 | 0.702 | 0.771 | 0.632 | 0.677 | 0.734 | 0.677 | 0.771 | 0.721 | 0.5 | 0.386 | 0.570 | 0.702 |
| 11 | 0.702 | 0.771 | 0.632 | 0.677 | 0.734 | 0.677 | 0.771 | 0.721 | 0.5 | 0.386 | 0.570 | 0.702 |
| 12 | 0.702 | 0.771 | 0.632 | 0.677 | 0.734 | 0.677 | 0.771 | 0.721 | 0.5 | 0.386 | 0.570 | 0.702 |
| 13 | 0.702 | 0.771 | 0.632 | 0.677 | 0.734 | 0.677 | 0.771 | 0.721 | 0.5 | 0.386 | 0.570 | 0.702 |
| 14 | 0.702 | 0.771 | 0.632 | 0.677 | 0.734 | 0.677 | 0.771 | 0.721 | 0.5 | 0.386 | 0.570 | 0.702 |
| 15 | 0.702 | 0.771 | 0.632 | 0.677 | 0.734 | 0.677 | 0.771 | 0.721 | 0.5 | 0.386 | 0.570 | 0.702 |
We choose MaxDepth = 4 for the classificant tree.
set.seed(1000)
cardiofit <- rpart(cardio ~ age + weight + ap_lo + height + cholesterol, data = df_num, method = "class", control = list(maxdepth=4))
printcp(cardiofit)
##
## Classification tree:
## rpart(formula = cardio ~ age + weight + ap_lo + height + cholesterol,
## data = df_num, method = "class", control = list(maxdepth = 4))
##
## Variables actually used in tree construction:
## [1] age ap_lo cholesterol
##
## Root node error: 34979/70000 = 0.5
##
## n= 70000
##
## CP nsplit rel error xerror xstd
## 1 0.32 0 1.0 1.0 0.004
## 2 0.02 1 0.7 0.7 0.004
## 3 0.01 4 0.6 0.6 0.003
## 4 0.01 5 0.6 0.6 0.003
plotcp(cardiofit)
summary(cardiofit)
## Call:
## rpart(formula = cardio ~ age + weight + ap_lo + height + cholesterol,
## data = df_num, method = "class", control = list(maxdepth = 4))
## n= 70000
##
## CP nsplit rel error xerror xstd
## 1 0.3223 0 1.000 1.005 0.00378
## 2 0.0220 1 0.678 0.678 0.00358
## 3 0.0149 4 0.612 0.612 0.00349
## 4 0.0100 5 0.597 0.597 0.00346
##
## Variable importance
## ap_lo age cholesterol weight
## 66 20 13 1
##
## Node number 1: 70000 observations, complexity param=0.322
## predicted class=No expected loss=0.5 P(node) =1
## class counts: 35021 34979
## probabilities: 0.500 0.500
## left son=2 (49369 obs) right son=3 (20631 obs)
## Primary splits:
## ap_lo < 85.5 to the left, improve=4380.0, (0 missing)
## cholesterol < 1.5 to the left, improve=1480.0, (0 missing)
## age < 55 to the left, improve=1350.0, (0 missing)
## weight < 75.5 to the left, improve= 834.0, (0 missing)
## height < 156 to the right, improve= 16.1, (0 missing)
## Surrogate splits:
## weight < 98.5 to the left, agree=0.708, adj=0.01, (0 split)
## height < 58 to the right, agree=0.705, adj=0.00, (0 split)
##
## Node number 2: 49369 observations, complexity param=0.022
## predicted class=No expected loss=0.385 P(node) =0.705
## class counts: 30343 19026
## probabilities: 0.615 0.385
## left son=4 (28886 obs) right son=5 (20483 obs)
## Primary splits:
## age < 55.1 to the left, improve=1160.00, (0 missing)
## cholesterol < 2.5 to the left, improve=1070.00, (0 missing)
## ap_lo < 73.5 to the left, improve= 422.00, (0 missing)
## weight < 79.5 to the left, improve= 326.00, (0 missing)
## height < 156 to the right, improve= 6.55, (0 missing)
## Surrogate splits:
## cholesterol < 2.5 to the left, agree=0.609, adj=0.059, (0 split)
## height < 150 to the right, agree=0.587, adj=0.005, (0 split)
## weight < 39.5 to the right, agree=0.585, adj=0.000, (0 split)
## ap_lo < 80.5 to the left, agree=0.585, adj=0.000, (0 split)
##
## Node number 3: 20631 observations
## predicted class=Yes expected loss=0.227 P(node) =0.295
## class counts: 4678 15953
## probabilities: 0.227 0.773
##
## Node number 4: 28886 observations, complexity param=0.0149
## predicted class=No expected loss=0.294 P(node) =0.413
## class counts: 20390 8496
## probabilities: 0.706 0.294
## left son=8 (27295 obs) right son=9 (1591 obs)
## Primary splits:
## cholesterol < 2.5 to the left, improve=460.00, (0 missing)
## age < 43.8 to the left, improve=228.00, (0 missing)
## ap_lo < 77.5 to the left, improve=221.00, (0 missing)
## weight < 79.5 to the left, improve=192.00, (0 missing)
## height < 164 to the left, improve= 5.15, (0 missing)
##
## Node number 5: 20483 observations, complexity param=0.022
## predicted class=Yes expected loss=0.486 P(node) =0.293
## class counts: 9953 10530
## probabilities: 0.486 0.514
## left son=10 (17692 obs) right son=11 (2791 obs)
## Primary splits:
## cholesterol < 2.5 to the left, improve=357.00, (0 missing)
## age < 61.1 to the left, improve=225.00, (0 missing)
## weight < 72.5 to the left, improve= 91.10, (0 missing)
## ap_lo < 78.5 to the left, improve= 79.90, (0 missing)
## height < 156 to the right, improve= 3.61, (0 missing)
##
## Node number 8: 27295 observations
## predicted class=No expected loss=0.273 P(node) =0.39
## class counts: 19855 7440
## probabilities: 0.727 0.273
##
## Node number 9: 1591 observations
## predicted class=Yes expected loss=0.336 P(node) =0.0227
## class counts: 535 1056
## probabilities: 0.336 0.664
##
## Node number 10: 17692 observations, complexity param=0.022
## predicted class=No expected loss=0.477 P(node) =0.253
## class counts: 9253 8439
## probabilities: 0.523 0.477
## left son=20 (12583 obs) right son=21 (5109 obs)
## Primary splits:
## age < 60.7 to the left, improve=183.00, (0 missing)
## ap_lo < 72.5 to the left, improve= 66.70, (0 missing)
## weight < 79.5 to the left, improve= 50.90, (0 missing)
## cholesterol < 1.5 to the left, improve= 22.30, (0 missing)
## height < 182 to the right, improve= 1.36, (0 missing)
## Surrogate splits:
## ap_lo < 0.5 to the right, agree=0.711, adj=0, (0 split)
##
## Node number 11: 2791 observations
## predicted class=Yes expected loss=0.251 P(node) =0.0399
## class counts: 700 2091
## probabilities: 0.251 0.749
##
## Node number 20: 12583 observations
## predicted class=No expected loss=0.431 P(node) =0.18
## class counts: 7158 5425
## probabilities: 0.569 0.431
##
## Node number 21: 5109 observations
## predicted class=Yes expected loss=0.41 P(node) =0.073
## class counts: 2095 3014
## probabilities: 0.410 0.590
There is our classification tree:
plot(cardiofit, uniform = TRUE, main = "Classification Tree for Cardio")
text(cardiofit, use.n = TRUE, all = TRUE, cex = 1)
Same tree with different looking:
loadPkg("rpart.plot")
loadPkg("rattle")
rpart.plot(cardiofit)
fancyRpartPlot(cardiofit)
loadPkg("caret")
cm = confusionMatrix( predict (cardiofit, type = "class"), reference = df_num[, "cardio"])
print('Overall: ')
## [1] "Overall: "
cm$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 7.02e-01 4.04e-01 6.98e-01 7.05e-01 5.00e-01
## AccuracyPValue McnemarPValue
## 0.00e+00 1.14e-247
print('Class')
## [1] "Class"
cm$byClass
## Sensitivity Specificity Pos Pred Value
## 0.771 0.632 0.677
## Neg Pred Value Precision Recall
## 0.734 0.677 0.771
## F1 Prevalence Detection Rate
## 0.721 0.500 0.386
## Detection Prevalence Balanced Accuracy
## 0.570 0.702
xkabledply(cm$table, "confusion matrix")
| No | Yes | |
|---|---|---|
| No | 27013 | 12865 |
| Yes | 8008 | 22114 |
library(rpart)
rp <- rpart(cardio ~., data = df_num)
library(ROCR)
pred <- prediction(predict(cardiofit, type = "prob")[,2], df_num$cardio)
tree.predict.prob <- predict(cardiofit, type = "prob")[,2]
plot(performance(pred, "tpr", "fpr"), main = "ROC Cardio")
auc = performance(pred, 'auc')
slot(auc, 'y.values')
## [[1]]
## [1] 0.735
tab1 <- matrix(c(0.672, 0.688, 0.643, 0.701, 0.662, 0.722, 0.743, 0.677, 0.767, 0.708, 0.702, 0.677, 0.771, 0.632, 0.721), ncol = 5, byrow = TRUE)
colnames(tab1) <- c("Accuracy", "Precision", "Recall", "Specificity", "F1 Score")
rownames(tab1) <- c("KNN with selected variables", "Logistic with selected variables", "Classification Tree with selected variables")
tab1 <- as.table(tab1)
xkabledply(tab1, "Model Comparison")
| Accuracy | Precision | Recall | Specificity | F1 Score | |
|---|---|---|---|---|---|
| KNN with selected variables | 0.672 | 0.688 | 0.643 | 0.701 | 0.662 |
| Logistic with selected variables | 0.722 | 0.743 | 0.677 | 0.767 | 0.708 |
| Classification Tree with selected variables | 0.702 | 0.677 | 0.771 | 0.632 | 0.721 |
We conducted a research about cardiovascular diseases. The main question was “does healthy lifestyle work?”.
After analyzing the dataset and the dataframe, we came up with the following conclusions.As we said we are looking through a set of data showing the factors impacting the cardio vascular disease. Those factors are smoking, alcohol intake, physical activity, age, weight, height, blood pressure and cholesterol.
According to the EDA analysis, the CVD is present in those who do not smoke, do not drink and do not workout. According to chisquare we found out that smoke, alcohol intake and physical activity have an impact on CVD According to the KDE plot, old people and weight close to 100 have CVD. According to spearman correlation, cholesterol doesnt have an impact on CVD
We also conducted logistic regression, KNN and classification Tree; The logistic regression showed that age, weight, blood pressure and cholesterol have an impact on CVD. The KNN showed that age and weight influence CVD. The classification tree showed that age, weight, height, blood pressure and cholesterol have an impact on CVD.
From these conclusions we can get the characteristics of CVD group pf people which is no smoking, no drinking, no physical activity, old people and weight close to 100. This means that a healthy lifestyle does not really work in this case.
Combined with the above analysis, CVD is mainly due to age and weight.
Some suggestions for the high level of CVD group will be to smoke, drink,workout and lose weight. This means a person can smoke and drink but has to do some physical activity to loose weight.